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Abstract 

The scalability of statistical estimators is of increasing importance in modern applications. One 
approach to implementing scalable algorithms is to compress data into a low dimensional latent 
space using dimension reduction methods. In this paper we develop an approach for dimension 
reduction that exploits the assumption of low rank structure in high dimensional data to gain both 
computational and statistical advantages. We adapt recent randomized low-rank approximation 
algorithms to provide an efficient solution to principal component analysis (PCA), and we use this 
efficient solver to improve parameter estimation in large-scale linear mixed models (LMM) for 
association mapping in statistical and quantitative genomics. A key observation in this paper is 
that randomization serves a dual role, improving both computational and statistical performance 
by implicitly regularizing the covariance matrix estimate of the random effect in a LMM. These 
statistical and computational advantages are highlighted in our experiments on simulated data and 
large-scale genomic studies. 

Keywords: dimension reduction, generalized eigendecompositon, low-rank, genomics, linear 
mixed models, supervised, random projections, randomized algorithms, Krylov subspace methods 


1. Introduction 

In the current era of information, large amounts of complex high dimensional data are routinely 
generated across science and engineering disciplines. Estimating and exploring the underlying low¬ 
dimensional structure in data and using this latent structure to study scientific problems is of funda¬ 
mental importance in a variety of applications. As the size of the data sets increases, the problem of 
statistical inference and computational feasibility become inextricably linked. Dimension reduction 
is a natural approach to summarizing massive data and has historically played a central role in data 
analysis, visualization, and predictive modeling. It has had significant impact on both statistical 
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inference ( 

AdcockI 1878[ Edegworlhj 1884t Eisher| 1922 Holellingl 1933 

Young 

1941 

1 , as well 

as on numerical analysis research and applications (Golub 

1969[ Golub and Van Eoan| 1996[ Gu 

and Eisensfaf 1996UGolub el aLj 20001; for a recenl review see (Mahoney| 201 1|). Historically, 


statisticians have focused on the study of theoretical properties of estimators often in the context 
of asymptotically large number of samples. Numerical analysts and computational mathematicians, 
on the other hand, have been instrumental in the development of useful and tractable algorithms 
with provable stability and convergence guarantees. Naturally, many of these algorithms have been 
successfully applied to compute estimators grounded on solid statistical foundations. A classic 
example of this interplay is principal component analysis (PC A) ( Hotelling 1933| ). In PC A, an 
objective function is defined based on statistical considerations about the sample variance in the 
data, which can then be efficiently computed using a variety of singular value decomposition (S VD) 
algorithms developed by the numerical analysis community. 

In this paper, we consider the problem of dimension reduction, focusing on the integration 
of statistical considerations of estimation accuracy and out-of-sample prediction error of matrices 
with latent low-rank with computational considerations of run time and numerical accuracy. The 
methodology that we develop builds on a classical approach to modeling large data, which first com¬ 
presses the data, minimizing the loss of relevant information, and then applies statistical estimators 
appropriate for small-scale problems. In particular, we focus on dimension reduction via general¬ 
ized eigendecomposition as the means for data compression, and on out-of-sample residual error as 
the measure of information loss. The scope of this work includes applications to a large number 
of dimension reduction methods, which can be implemented as solutions to truncated generalized 
eigendecomposition problems (Hotelling 1933} Fisher[ 1936] Li 1991 Wuetal. 20101. In this pa¬ 
per our first focus is on the increasing need to compute an SVD of massive data using randomized 


algorilhms developed in fhe numerical analysis communily (Drineas el al. 

2006 

Sartosj 20061 |Lib2 

erly el al.| 2007[ Boufsidis el al.| 2009^ |Rokhlin el al. 

2009||Halko el al.[ 

20111 to simullaneously 


reduce the dimension and regularize or control the impact of independent random noise. 

The second focus in this paper is to provide efficient solvers for the linear mixed models that 
arise in statistical and quantitative genomics. In high-throughput genomics experiments, a vast 
amount of sequencing data is collected—on the order of tens of millions of genetic variants. The 
goal of genome-wide association studies is to test for a statistical association at each variant (poly¬ 
morphic position) to a response of interest (e.g., gene expression levels or disease status) in a sample 
cohort. However, as the size of these genomic data and sample sizes continue to increase, there is 
an urgent need to improve the statistical and computational performance of standard tests. 

It is typical to collect several thousand individuals for one study. These individuals may come 


from several genetically heterogeneous populations. It has been recognized since 2001 (Pritchard 


and Donnelly! 200 1|) that the ancestry makeup of the individuals in the study has great potential 


to influence sfudy resulfs—in parficular, spurious associafions arise when genetic varianfs wifh 
differenfial frequencies may appear fo be associated wifh fhe biased response variable via lafenf 
populafion sfrucfure. 

The earliesf mefhods (e.g., genomic confrol) accounted for population sfrucfure by using co- 
variafe esfimafes fo correcf for fhese confounding signals. More recenfly, linear mixed models have 
been used successfully fo correcf spurious resulfs in fhe presence of populafion sfrucfure. LMMs 


have been shown fo improve power in association sfudies while reducing false posifives (Yang el al. 


20141. However, mixed models incur a high compufafional cosf when performing associafion sfud- 


ies because of fhe compufafional burden of computing a covariance mafrix for fhe random effecl 
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controlling for population structure. Significant work has gone into mitigating such costs using 
spectral decompositions for efficient covariance estimation (Kang et al. 2008 2010| Yang et al. 
201 H Zhou and Stephen^ 2012| Listgarten et al. 20121. 


In this work, we show, using simulations of genomic data with latent population structure and 
real data from large-scale genomic studies that adaptive randomized SVD (ARSVD) developed 
here performs with both computational efficiency and numerical accuracy. Under certain sellings, 
we find lhal Ihe LMM using ARSVD oulperforms currenl slale-of-lhe-art approaches by implicifly 
performing regularizafion on Ihe covariance mafrix. 

There are Ihree key conlribulions of Ihis paper: 


(i) We develop an adaptive algorilhm for randomized singular value decomposilion (SVD) in 
which bofh fhe number of relevanl singular vectors and fhe number of ileralions of fhe algo¬ 
rilhm are inferred from fhe dala based on informalive sfalisfical crileria. 


(ii) We use our adaptive randomized SVD (ARSVD) algorilhm fo consfrucl Iruncaled generalized 
eigendecomposilion eslimafors for PC A and linear mixed models (LMMs) ( Lisfgarfen el al.[ 
|2012t Zhou and Sfephen^|2012 1. 


(iii) We demonslrale on simulaled and real dala examples lhal Ihe randomized estimators provide 
a compulalionally efficienl solution, and, furthermore, often improve slalislical accuracy of 
Ihe predictions. We show lhal in an over-paramelrized silualion Ihis improvemenl in accuracy 
is due to implicil regularization imposed by Ihe randomized approximation. 


In Section]^ we describe Ihe adaptive randomized SVD procedure we use for Ihe various dimension 
reduction melhods. In Section [23| we provide randomized estimators for linear mixed models used 
in quanlilalive genetics. In Section we validate Ihe proposed melhodology on simulated and real 
dala and compare our approach wilh slale-of-lhe-art approaches. In particular, we show resulls from 
our approach for estimating low dimensional geographic slruclure in genomic dala and for genetic 
association mapping applications. 


2. Randomized algorithms for dimension reduction 


In Ihis section we develop algorilhmic extensions for PCA. We slate an algorilhm lhal provides a 
numerically efficienl and slalislically robusl estimate of Ihe highesl variance directions in Ihe dala 


using a randomized algorilhm for singular value decomposition (Randomized SVD) (Rokhlin el al. 


2009 Halko el ah] 201 1|). In Ihis problem, Ihe objective is linear unsupervised dimension reduction 


wilh Ihe low-dimensional subspace estimated via an eigendecomposilion. Randomized SVD will 
serve as Ihe core compulalional engine for Ihe olher estimators we develop in Ihis paper. 


2.1 Notation 

Given positive integers p and d wilh p ^ d, denotes Ihe class of all malrices wilh real enlries 
of dimension p x d, and SS^_^_ (SS^) denotes Ihe sub-class of symmelric positive definite (semi- 
definile) p x p malrices. For B G span(fl) denotes Ihe subspace of MP spanned by Ihe 

columns of B. A basis matrix for a subspace S is any full column rank malrix B G MP^'^ such 
lhal S = span(i?), where d = dim(5). In Ihe case of sample dala X, eigen-basis(Y) denotes Ihe 
orlhonormal left eigenvector basis. Denote Ihe dala malrix by Y = (xi,..., Xn)"’" G where 
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each sample Xi is a assumed to be generated by a p-dimensional probability distribution . In the 
case of supervised dimensions reduction, denote the response vector to be y G M” (quantitative 
response) or y G {1,..., C} (categorical response with C categories), and Y ~ Vy The data and 
the response are assumed to have a joint distribution (X, y) ~ Vxxv 


2.2 Computational considerations 


The main computational tool we use is a randomized algorithm for approximate eigendecompositon, 
which factorizes an x p matrix of rank r in time 0{npr) rather than the 0{np x min(n,p)) time 
required by approaches that do not take advantage of the special structure in the input. This is 
particularly relevant to statistical applications in which the data is high dimensional but reflects a 
highly constrained process, e.g., from biology or finance applications, which suggests it has low 
intrinsic dimensionality, i.e., r <C n < p. An appealing characteristic of our randomized algorithm 
is the explicit control of the trade-off between estimation accuracy relative to the exact estimates 
and computational efficiency. Rapid convergence to the exact estimates was shown both empirically 


as well as in theory in (Rokhlin et al.[ 20091. 


2.3 Statistical considerations 

A central concept in this paper is that the randomized approximation algorithms that we use for 
statistical inference implicitly impose regularization constraints. This idea is best understood by 
putting the randomized algorithm in the context of a statistical model. The numerical analysis and 
theoretical computer science perspective is deterministic and focuses on optimizing the discrepancy 
between the randomized and the exact solution, which is estimated and evaluated on a fixed data 
set. However, in many practical applications, the data are noisy random samples from a popula¬ 
tion. Hence, when the interest is in dimension reduction, the relevant error comparison is between 
the true subspace that captures information about the population and the corresponding algorithmic 
estimates. The true subspace is typically unknown, so we can validate the estimated subspace by 
quantifying the out-of-sample generalized performance. A key parameter of the randomized esti¬ 
mators, described in detail in Section |2.4[ is the number of power iterations used to estimate the 
span of the data. Increasing values for this parameter provide intermediate solutions to the factor¬ 
ization problem, which converges the to the exact answer ( Rokhlin et al.[[2009 1. Fewer iterations 
correspond to reduced run time but also to larger deviation from the sample estimates and hence 
stronger regularization. We show in Section]^ the regularization effect of randomization on both 
simulated and real data sets. We argue that, in certain contexts, fewer iterations may be justified 
both by computational and statistical considerations. 


2.4 Adaptive randomized low-rank approximation 

In this section we provide a brief description of a randomized estimator for the best low-rank matrix 
approximation, introduced by ( [Rokhlin et al.[ 2009 Halko et al. 20111, which combines random 
projections with numerically stable matrix factorization. We consider this numerical framework as 
implementing a computationally efficient shrinkage estimator for the subspace capturing the largest 
variance directions in the data, particularly appropriate when applied to input matrix of (approx¬ 
imately) low rank. Detailed discussion of the estimation accuracy of Randomized SVD in the 
absence of noise is provided in ([Rokhlin et al.[ [2009]). The idea of random projection was first 
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developed as a proof technique to study the distortion induced by low dimensional embedding of 


high-dimensional vectors in the work of ([Johnson and Lindenstrauss| 1984 

1 with much literature 

simplifying and sharpening the results ( 

Frankl and Maeharal 1987[ Indyk and Motwani 

1998 

Das- 

gupta and Guptaj 2003 [ Achlioptas| |2001 

1. More recently, the theoretical computer science and 


the numerical analysis communities discovered that random projections can be used for efficient 


approximation algorithms for a variety of applications (Drineas et al. 

2006 

Sarlos 

2006 

Fiberty 

|et al.[|2007t|Boutsidis et al.||2009 

Rokhlin et al.| |2009t |Halko et al. i 

101 ll. 

Ve focus on one such 

the accurate low-rank 

approach proposed by (Rokhlin et al. 

2009 

Halko et al.l 20111 which targets 


approximation of a given large data matrix X G In particular, we extend the randomization 

methodology to the noise setting, in which the estimation error is due to both the approximation 
of the low-rank structure in X, as well as the added noise. A simple working model capturing this 
scenario is as follows: X = Xd* + E, where Xd* G rank (Xd*) = d* captures the low 

dimensional signal, while E is independent additive noise. 

Algorithm. Given an upper bound on the target rank dmax and on the number of necessary power 
iterations f^ax (fmax G {5, ... 10} is sufficient in most cases), the algorithm proceeds in two stages: 
(1) estimate a basis for the range of Xd*, (2) project the data onto this basis and apply SVD: 


Algorithm: Adaptive Randomized SVD{X, fmax> f^max> 


(1) Find orthonormal basis for the range of 

(i) Set the number working directions: £ = d 

(ii) Generate random matrix: Q G with Qij ~ 


max “h 
iid 


N(0,1); 

(hi) Construct blocks: = XX^F^-^') with = fl for t G {1,..., tmaxj; 

(iv) Select optimal block f* G {1,..., fmaxj and rank estimate d* G {1,..., dmax}> using a 
stability criterion and Bi-Cross-Validation (see Section 2.4.11; 

(v) Estimate basis for selected block using SVD: F^**) = QR G Q^Q = /; 


(2) Project data onto the range basis and compute SVD; 

(i) Project onto the basis: B = X^Q G 

(ii) Factorize: B UEW'^, where E = diag(cji,..., ur); 

(iii) Rank d* approximation: Xd* = Ud* Xd* Vjt 

Ud* = {Ui\...\Ud*)^W^^'^* 

Xd* = diag(cri, ...,ad*)GR'^*'^'^* 

Vd* = Qx (lVi|...|lVrf*) 


In stage (1) we set the number of working directions i = dmax -|- A to be the sum of the upper 
bound on the rank of the data, dmax and a small oversampling parameter A, which ensures more 
stable approximation of the top dmax sample variance directions; the estimator tends to be robust to 
changes in A, so we use A = 10 as a suggested default. In step (l.iii) the random projection matrix 
D is applied to powers of XX"^ to randomly sample linear combinations of eigenvectors of the data 
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weighted by powers of the eigenvalues: 

= {xx^fn = = us^^n* 


where X "= USV^. 


nxi 


The purpose of the power iterations is to inerease the decay of the noise portion of the eigen- 
spectrum while leaving the eigenvectors unchanged. This is effectively a form of regularization 
as we are shrinking eigenvalues. Note that each column j of corresponds to a draw from a 
77,-dimensional Gaussian distribution: ~ N{0,US^^U'^), with the covariance structure more 

strongly biased towards higher directions of variation as t increases. This type of regularization is 
analogous to the local shrinkage term developed in prior work ( Poison and Scott[ 2010). In step 
(iv), we select an optimal block F^**) for F G {1,..., fmax} and estimate an orthonormal basis for 
the column space. In previous work ( [Rokhlin et al. 2009| , the authors assumed fixed target rank 
d* and approximated X, rather than X^*. They showed that the optimal strategy for this purpose is 
to set t* = ^max^ which typically achieves excellent d*-rank approximation accuracy for X, even 
for relatively small values of tmax- In this work we focus on the noisy case, where F / 0, and 
propose to adaptively set both d* and t *, aiming to optimize the generalization performance of the 


randomized estimator. The estimation strategy for d* and t* is described in detail in Section 2.4.1 


In stage (2) we rotate the orthogonal basis Q computed in stage (1) to the canonical eigenvector 
basis and scale according to the corresponding eigenvalues. In step (2.i) the data is projected onto 
the low dimensional orthogonal basis Q. Step (2.ii) computes the exact SVD in the projected space. 

Computational complexity. The computational complexity of the randomization step is O {np x 
dmux X fmax) and the factorizations in the lower dimensional space have complexity 0{np x dmax + 
n X dmax)- With dmax Small relative to n and p, the run time in both steps is dominated by the 
multiplication by the data matrix; in the case of sparse data, fast multiplication can further reduce 
the run time. We use a normalized version of the above algorithm that has the same run time 


complexity but is numerically more stable (Martinsson et al.[ Vancouver, Canada, 2010 l. 


2.4.1 Adaptive method to estimate d* and t* 


We propose to use ideas of stability under random projections in combination with cross-validation 
to estimate the intrinsic dimensionality of the dimension reduction subspace d* as well as the opti¬ 
mal value of the eigenvalue shrinkage parameter t*. 


Stability-based estimation of d *: First, we assume the regularization parameter t is fixed and 
describe fhe esfimafion of fhe rank paramefer d*{t), using a sfabilify criferion based on random 
projecfions of fhe dafa. We sfarf wifh rough upper-bound esfimafe dmax for fhe dimension paramefer 
d*. We fhen apply a small number (B = 5) of independenf Gaussian random projecfions 12^^^ G 

for 6 G B}. Given fhe projecfions we compufe an esfimafe of 

fhe eigenvecfor basis of fhe column space onfo fhe projecfed dafa. We fhen denoise fhe esfimafe by 
raising all fhe eigenvalues fo fhe power t: 

... \ul^J) = SYDliXX^fn^^^] for 6 G {1,..., B}. 

The /c-fh principal basis left singular vector estimate (fc G {1,..., d}) is assigned a stability score: 


stab(f, k, B) 


1 

N 


B-l B 


E E 


cor 




where N = 


2 
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Here is the estimate of the principal eigenvector of X based on the r-th random pro¬ 
jection and cor denotes the Spearman rank-sum correlation between and 

Eigenvector directions that are not dominated by independent noise are expected to have higher 
stability scores. When the data has approximately low-rank we expect a sharp transition in the 
eigenvector stability between the directions corresponding to signal and to noise. In order to esti¬ 
mate this change point, we apply a non-parametric location shift test (Wilcoxon rank-sum) to each 
of the dmax — 2 stability score partitions of eigenvectors with larger versus smaller eigenvalues. The 
subset of principal eigenvectors that can be stably estimated from the data for the given value of t is 
determined by the change point with smallest p-value among all dmax — 2 non-parametric tests. 

dt = argmin p-value(/c, f) 


where p-value(k,t) is the p-value from the Wilcoxon rank-sum test applied to the {stab(f, i, 
and {stab(f, i, 


Estimation of t* : In this section we describe a procedure for selecting an optimal value for t G 
fmax} based on the reconstruction accuracy under Bi-Cross-Validation for SVD (Owen and 


[Perry 2009[ ), using the generalized Gabriel holdout pattern ( Gabriel[ 20021. The rows and columns 
of the input matrix are randomly partitioned into r and c groups respectively, resulting in a total of 
r X c sub-matrices with non-overlapping entries. We apply Adaptive Randomized SVD to factorize 
the training data from each combination of (r — 1) row blocks and (c — 1) column blocks. In each 
case the submatrix block with the omitted rows and columns is approximated using its modified 
Schur complement in the training data. The cross-validation error for each data split corresponds 
to the Frobenius norm of the difference between the held-out submatrix and its training-data-based 
estimate. For each sub-matrix approximation we estimate the rank d* = d{t) using the stability- 
based approach from the previous section. As suggested in previous work ( Owen and Perry[[2009 1, 
we fix c = r = 2, in which case Bi-Cross-Validation error corresponding fo holding ouf fhe lop 


lefl block A of a given block-parlifioned malrix 


C D 


becomes 


\A-BD+C\\l^. 


Here = 


VS~^U'^ is fhe Moore-Penrose pseudoinverse of D USV'^. For fixed value of t we estimate 
d{t) and factorize D~^ using Adaptive Randomized SVD(t, d(t), A = 10) and denote fhe Bi-Cross- 
Validafion error by BiCV(t, A). The same process is repealed for fhe ofher fhree blocks B, C, and 
D. The final error and rank estimates are defined fo be fhe medians across all blocks and are denoled 
as BiCV(f) and dt, respeclively. We oplimize over fhe full range of allowable values for t fo arrive 
af fhe final eslimales 


^BiCV 

^BiCV 


arg mm 

ISflviImax} 


BiCV(f) 


2.5 Fast linear mixed models 

Mullivariale linear mixed models (FMMs) have been a workhorse in slalislical genetics and quan- 
filafive genetics because Ihey allow for fhe regression of explanalory on outeome variables while 
modeling relatedness befween samples (Henderson 1984[ ^ce ef al.[[2011 Krofe et al. 20121. In 
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the context of mapping traits LMMs are used to correct for unobserved sources of statistical con¬ 
founding in genetic association studies—particularly the presence of population structure amongst 
samples. 

The linear mixed models in this paper take the form 

y = x/3 + Z/r + e, 

where y is the response, x are the covariates, /? are the fixed effects, Z is the random effects design 
matrix, and and p captures unobserved covariates with var(p) = agK and var(e) = a^I. Here, K 
is the estimated kinship or genetic relatedness matrix, and is the proportion of variance in the 
phenotypes explained by genetic factors. In the standard application to genetic association studies, 
the response y is the quantitative trait across individuals and x is a p-vector of genotypes of those 
individuals across many genomic locations. The statistical test then is to determine whether the 
coefficient for a particular genetic variant, Pj, is equal to zero, indicating no genetic association, or 
alternatively whether / 0, indicating support for a possible genetic association at this genetic 
locus with the trait. The likelihood for this model follows from a multivariate normal distribution: 


V = 


N(X/3,V) 


I 


T^r-1 


Ky) = --[nlog(27r)+log|H| + (y-X/3) H 


iy-xf3)]. 


The standard procedure to correct for sample (population) structure using an LMM proceeds in 
three main steps: (1) empirical estimation (construction) of the genetic relatedness matrix (GRM) 
amongst individuals based on held-out genetic data; (2) estimation of variance components to ac¬ 
count for sample structure contributing to response; and (3) computation of an association statistic 
at each genotype position. 

Many methods have focused on reducing the complexity of one or more of these steps to scale 
LMMs to current study sizes. The EMMA method ( Kang et 2008 | l uses spectral decomposition 
to jointly estimate the likelihood of the response and random effects together instead of obtaining 
estimates of random effects a priori, which incurs a greater computation cost. EMMAX ( |Kang 
et al. 2010| ) improves upon EMMA by estimating variance components only once (independent of 
the number of genetic variants tested), making the assumption that most variants have a small effect 
on the phenotype (the pylmm software used here, along with our adaption, is based on EMMAX). 


EaST-EMM ( [Eippert et al.[|20lT[ ), in addition to using a heuristic low-rank spectral decomposition 
to construct an orthogonal representation of the GRM, uses rotations of the genetic variants and 


response to perform computation of the likelihood function in linear time. Eurther methods (Eippert 


et al.[ 20131 use more systematic methods for subsetting genetic variants that are relevant to the 


phenotype of interest in order to produce accurate variance component estimations in a EMM. 

Our contribution to accelerating parameter estimates in this model is to use ARSVD to address 
the complexity of estimating the design matrix Z. In general, the GRM is computed using a sample- 
by-sample covariance of the design matrix (nxp), which has a computational complexity of 0{np^) 
to build, where typically p dominates n in genomic studies. Decomposing this matrix into an exact 
orthogonal representation using an eigendecomposition has complexity 0(n^) in the worst case. We 
forgo explicit computation of the GRM, and instead perform ARSVD on the original design matrix, 
with a computational complexity of 0{np x dmax + n x dmax)- using ARSVD to decompose 
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the design matrix, we automatically detect low-rank structure present in the GRM, and we avoid the 
need to manually or heuristically subset the data to achieve a low-rank representation. We note in 
the simulated data and in the genomic study results that this has both a substantial acceleration in the 
computational speed and also the implicit regularization of the design matrix Z from our approach 
improves the type I error substantially. 


3. Results on simulated data 

We use real and simulated data to demonstrate three major contributions of this work: 

1. In the presence of interesting low-rank structure in the data, the randomized algorithms tend 
to be much faster than the exact methods with minimal loss in approximation accuracy. 

2. The rank and the subspace containing the information in the data can be reliably estimated 
and used to provide efficient solutions to dimension reduction methods based on the truncated 
(generalized) eigendecomposition formulation. 

3. The randomized algorithms implicitly impose regularization, which can be adaptively con¬ 
trolled in a computationally efficient manner to produce improved out-of-sample perfor¬ 
mance. 


3.1 Unsupervised dimension reduction 


We begin with unsupervised dimension reduction of data with low-rank structure contaminated with 
Gaussian noise, and we focus on evaluating the application of Adaptive Randomized SVD for PGA 
(see Section 2.4 1 . In particular, we demonstrate that the proposed method estimates the sample 
singular values with exponentially decreasing relative error in t. Then we show that achieving 
similar low-rank approximation accuracy to a state-of-the-art Lanczos method requires the same 
run time complexity, which scales linearly in both dimensions of the input matrix. This makes our 
proposed method applicable to large data matrices. Lastly, we demonstrate the ability to adaptively 
estimate the underlying rank of the data, given a coarse upper bound. For the evaluation of the 
randomized methods, based on all our simulated and real data, we assume a default value for the 
oversampling parameter A = 10. 


Simulation setup The data matrix X G is generated as follows: X = USV'^ E, where 

U'^U = V'^V = Id*- The d* columns of U and V are drawn uniformly at random from the corre¬ 
sponding unit sphere and the singular values S = diag(si,..., s^*) are randomly generated starting 
from a baseline value, which is a fraction of the maximum noise singular value, with exponential 
increments separating consecutive entries: 


Sj = Sj-i -h Uj, for j € {2,..., d*} 

Vj ~ Exp(l), Uq = K X 

The noise is iid Gaussian: Eij ~ A (O, ^). The sample variance has the SVD decomposition 
E UeSeV^, where Se = diag ... , p)) singular values in decreasing 

order. The signal-to-noise relationship is controlled by k, large values of which correspond to 
increased separation between the signal and the noise. 
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Results In our first simulation we set k = 1 and assume the rank d* to be given and fixed to 
50. The main focus is on the effect of the regularization parameter t controlling the singular value 
shrinkage, with larger values corresponding to stronger preference for the higher versus the lower 
variance directions. The simulation uses an input matrix of dimension 2,000 x 5,000. Studying the 
estimates of the percent relative error of the singular values averaged over ten random data sets, we 
observed exponential convergence to the sample estimates with increasing t (Table[T]l. This suggests 
that the variability in the sample data directions is approximated well for large data sets at the cost 
of few data matrix multiplications. 


t 

1 

2 

3 

4 

5 

% relative error 

26.1 ±0.2 

8.8 ±0.2 

3.0 ±0.1 

1.0 ±0.1 

0.3 ±0.0 

log^ofmean] 

4.7 

3.1 

1.6 

-0.1 

-1.9 


Table 1: Singular value estimation using Adaptive Randomized SVD. We report the mean 
relative error (±1 standard error) averaged across ten random replicates of dimension 
2, 000 X 5,000, with rank 50. Linear increase in the regularization parameter t results 
in exponential decrease in the percent relative error. 


In addition to minor relative error compared with exact approaches, we further investigate the 
source of putative error (Figure [^. Data were generated from n = 1, 000 and p = 1, 000 with 
true rank d* = 50. We note our method is extremely precise for the greatest singular values but 
has a slightly increased decay rate of singular values compared to exact methods, where less overall 
variance is captured in directions present in the data that are more influenced by noise. Our method 
has larger standard error and tends to slightly underestimate the smallest singular values, which we 
interpret as providing regularization to directions of noise, as further explored in Section 
Similar ideas have been proposed before in the absence of randomization. Perhaps the most well- 
established and computationally efficient methods are based on Lanczos Krylov Subspace estima¬ 
tion, which also operate on the data matrix only through matrix multiplies (|Saad[ 1992t Lehoucq 


et al.[ 1998 Stewart[ 2001; Baglama and Reich^|2006 1. These methods are also iterative in nature 


and tend to converge quickly with run time complexity, typically scaling as the product of the data 
dimensions 0{qnp), with q small. 

In order to further investigate the practical run time behavior of Adaptive Randomized SVD we 


used one such state-of-the art low-rank approximation algorithm. Blocked Lanczos (Baglama and 


Reichel[ 20061. Here we use the same simulation setting as before with fixed d* = 50, but we 


vary the regularization parameter t to achieve comparable percent relative low-rank Frobenius norm 
reconstruction error (to 1 d.p.). In Tablej^we report the ratio of the run times of the two approaches 
based on ten random data sets. Notice that the relative run time remains approximately constant 
with simultaneous increase in both data dimensions, which suggests similar order of complexity 
for both methods. In the presence of low-rank structure. Adaptive Randomized SVD is feasible 


for larger data problems where full-factorization methods (e.g. (Anderson et al. 19901) are not; 


moreover. Adaptive Randomized SVD achieves good approximation accuracy for the top sample 
singular values and singular vectors. 

In many modem applications the data have latent low-rank structure of unknown dimension. In 
Section |2.4. 1[ we address the issue of estimating the rank using a stability-based approach. Next, 


10 




























Adaptive Randomized Dimension Reduction on Massive Data 



0 10 20 30 40 50 


Singular value rank 


Figure 1: Singular value accuracy of ARSVD. Simulation results from random data sets of di¬ 
mension n = 1, 000 and p = I, 000. Singular values (ai) obtained from exact SVD are 
point estimates in red. Singular value confidence intervals are twice the standard error in 
estimates from ten estimations using ARVSD. 


n - 1 - p 

6,000 

7,500 

9,000 

10,500 

12,000 

relalive time 

2.5 ±0.05 

1.84 ±0.03 

1.82 ±0.03 

1.83 ±0.02 

1.84 ±0.04 


Table 2: Run time ratio between Adaptive Randomized SVD and Lanczos. Results are across 
different data dimensions n and p and rank d* = 50. The Lanczos implementation is from 
the irlba CRAN package ( Baglama and Reichel[ 2006 1 . The Adaptive Randomized SVD 
was run until the percent relative low-rank reconstruction error was equal to the Lanczos 
error to 1 d.p. For each consecutive simulation scenario n is incremented by 500 and p is 
incremented by 1, 000. We report the sample mean and standard error based on ten random 
replicate data sets. 


we studied the ability of our randomized method to perform rank estimation to identify the di¬ 
mensionality in simulated low-rank data. For that purpose, we generated 50 random data sets, 


n = 1,000,p = 1,000, d* ~ Uniform[10, 50], and k = 1,2. We set an initial rank upper bound 
estimate to be d* * 2 and used the adaptive method ARSVD (Section 2.4.1 1 to estimate both optimal 
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t* and the corresponding d*. Figure plots the estimated versus the true rank and the correspond¬ 
ing estimates of the regularization parameter t* for two different signal-to-noise scenarios. In both 
scenarios, the rank estimates showed good agreement with the true rank values. When k = 1 the 
smallest “signal” direction has the same variance as the largest variance “noise” direction, which 
causes our approach to slightly underestimate the largest ranks. This is due to the fact that the few 
smallest variance signal directions tend to be difficult to distinguish from the random noise and 
hence less stable under random projections. We observe that our approach tends to select small val¬ 
ues for t*, especially when there is a clear separation between the signal and the noise (right panel 
of Figure]^. This results in a reduced number of matrix multiplications and hence fast computation. 


3.2 Results on simulated data with latent population structure 


To simulate a genomic association study with latent population structure, we generate genotypes 
and phenotypes that are conditionally independent given the true population structure, as described 
in ( Mimno et al.[ 20141. Each individual’s genotype is admixed, containing a proportion of an¬ 
cestry from each population 1,..., iT. The admixture proportions for each individual i, are drawn 
9i ~ DirK{a). Since we only consider two alleles at each position in the genome, the frequency 
of each variant j = 1,..., P in population k is drawn ~ Beta{l, 1) representing a uniform 
allele frequency. To generate latent assignments of each variant to a population, we create variant- 
population assignments by generating zi^ij and Z 2 ,ij from Mult{9i). The final generafion of geno- 
fypes draws each varianf j from individual i as Bin{(f)j, k). To generafe a phenofype (response) 
vector for fhe k^^ populafion, we draw y* ~ Bern{0.59k,i + 0.1(1 — 9k,i))- 


Resulfs are generafed for four disfincf scenarios representing dafa wifh differing sfrucfure. The 
simulations presenfed here have no frue associafions befween genofypes (fealures) and phenofypes 
(response), buf are dependenf conditional on lafenf populafion sfrucfure. This sifuafion mimics fhe 
commonly observed confounding effecf in genomic dafa, and mofivafes fhe need for LMMs and 
regularization in genome-wide studies. Given the simulations are entirely under the null hypothesis 
of no association between genotype and phenotype, it is expected that p-values follow a uniform 
distribution; we consider any result exceeding some a-threshold a false positive, occurring at rate 
a. 


In each simulation, the p-value distribution from an EMM association without ARSVD is re¬ 
ported as the full data matrix rank (the largest rank on the x-axis). The top left simulation uses 
n = 1,000 (samples or genetic variants) and p = 1,000 (responses or values of the phenotype), 
representing a situation with few samples. Our method performs some slight regularization beyond 
that of a standard EMM with rank close to full data rank (EigureQ. The next simulation (top right, 
n = 1, 000 and p = 5, 000) is the most common case in genomics with p 3> n, or a large set 
of features compared to responses. Our method performs equally well as current methods but at a 
fraction of the computational cost, as illustrated by the lowest-rank estimation that still yields uni¬ 
form p-values and low false positive rate. In the bottom left panel we generate a case of an over 
parametrized model (n = 5,000 and p = 1,000), for which our method produces significant false 
positives compared to a standard EMM; however, we note that if we know a priori that our model 
is over parametrized, we can modify our decomposition to perform in the sample space instead of 
feature space and achieve comparable results to a standard EMM. In the last setting (bottom right, 
n = 5,000 and p = 5, 000) again our method performs similarly to the standard method but with 
gains in computational efficiency. In general, we expecf the tradeoff between computational effi- 
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Figure 2: Rank estimation of ARSVD. Simulation results from 50 random data sets of dimension 
n = 10,000 andp = 10,000, with true rank d* ~ Uniform[10, 50]. Panel A,C: True rank 
{d*) on the x-axis, and estimated rank {d*) on the y-axis for signal-to-noise-parameters 
K = 1, K = 2, respectively. Panel B,D: estimations of t* with max t set to ten for 
signal-to-noise parameters k = 1, k = 2, respectively. 


ciency and accuracy to be dependent on the data analyzed. In the case of structured genomic data, 
we report in these simulations and in Sectionj^that massive computational savings are accompanied 
by numerical accuracy. 
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io“- 

103.5 ^q 4.5 ^g5 

Number of features (p) 


Figure 3: CPU time for three matrix decomposition methods with respect to number of fea¬ 
tures. Simulation results from pseudo-random data sets with range of number of features, 
p = 2,000 to 100,000 (rsvd), 80,000 (eig), and 40,000 (svd). The number of samples, n 
is one tenth the number of features. Run time is reported in terms of CPU-time (sec¬ 
onds) versus total number of features, p. Both axes scale with the log of the axis metric. 
SVD is performed on the complete n x p matrix. Eigendecomposition is performed with 
the sample covariance, that is, after multiplying design matrix X by and performing 
a decomposition on the resulting n x n matrix. RSVD estimates the top 100 principal 
components. 


4. Applications to large scale genomic data 

4.1 Results on WTCCC association mapping 


We test our approach that incorporates ARSVD in EMM association studies on the Wellcome Trust 


Case Control Consortium (WTCCC) data (Consortium 20071 that includes 4,684 individuals (half 


case individuals with Crohn’s disease and half control individuals without Crohn’s disease) and 
478,765 genetic variants across the 22 autosomal chromosomes in the genome. We compare our 
method ARSVD incorporated in the next generation EMM software after EMMAX (pylmm) to 
GEMMA. Performing ARSVD on the whole genome took 82.2 secs, while the traditional eigende¬ 
composition of the covariance matrix in pylmm took 88 mins 23.9 sec. In order to most accurately 
control for the test statistic computed in a EMM and achieve maximal statistical power, it is sug¬ 
gested that a covariance matrix is constructed once per (22) chromosomes, performed by holding 
out the test chromosome and concatenating the remaining chromosomes ([Yang et 2014[). Our 
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Fraction False Positives 

^ 0.0086 

^ 0.009 

^0.016 

^0.0162 

^ 0.0176 

^0.018 

^ 0.0198 

^ 0.021 

^0.03 

^ 0.0314 

^ 0.042 

^ 0.045 

^ 0.062 

^ 0.068 

^ 0.0722 

^0.1042 

^0.106 

^ 0.1622 

^0.167 

— 0.2452 
fcO .25 
^ 0.275 

— 0.311 
S 0.343 
^ 0.396 
r — 0.41 

— 0.412 
^0.418 
^ 0.452 
1)10.516 
410.611 
4 i 0.729 
^0.816 
^0.82 
^ 0.9144 
^0.919 


Figure 4: Regularization simulations. X-axis is latent rank used to project the original matrix 
into a lower subspace. Y-axis is p-values across p tests for association. Each box plot is 
colored by the fraction of false positives in the association results for a given rank. Top 
left: n = 100 and p = 100. Top right: n = 100 and p = 1, 000. Bottom left: n = 1,000 
and p = 100. Bottom right: n = 1, 000 and p = 1, 000. 


method performs the 22 decompositions in a total of 5 mins 4.8 secs, while the traditional method 
performs the same decomposition in 4 hrs 24 mins. 


GEMMA is executed with the -1mm option to most closely resemble the analysis that pylmm 
performs. Despite having discrete phenotypes (disease status {0,1}), we note that estimating coef¬ 
ficients /3 (effect sizes) with linear regression and logistic regression have nearly indistinguishable 
impacts on computing association statistics ( Zhou et’di] 20131. Estimates of (3 for pylmm and 
GEMMA have a high correlation, yet GEMMA shows no enrichment for significant p-values. On 
the other hand, the distribution of p-values estimated from pylmm is closer to what we expect from 
an analysis on true associations, showing that our method, while providing some regularization, 
does not impact enrichment of true signal. 
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Figure 5: Genome-wide association study results for pylmm and GEMMA. Panel A: (5 of every 
analyzed variant in the genome, GEMMA versus pylmm. Panel B: Histogram of p-values 
for eaeh method. 


We investigated the most significant genetic associations identified by our mefhod and compared 
fo previous resulfs from a large-scale mefa-analysis of Crohn’s disease. This mefa-analysis had 
a much larger sample size fhan fhe WTCCC: if included 6,333 affecfed individuals (cases) and 
15,056 confrols and followed up fhe lop associafion signals in 15,694 cases, 14,026 conlrols, and 
414 parenf-offspring Irios. Table [^reporls fhe curaled lisl of associated genetic varianls reporled in 
Ihis mefa-analysis Ihrough various sources ( [Lisfgarfen ef al.[ 20121. This includes genetic varianls 
identified by mefa-analysis ( Franke ef al.[ 2010| ), fhe original WTCCC sludy ( Consorfium[ 20071, or 
Ihose lhal were confirmed more recenlly via several melhods and known fo correspond lo a region 
associated wifh Crohn’s disease risk ( [Lisfgarfen ef al.[ 20121. For each analysis fype, we reporl 
fhe number of varianls found significanl using fhis analysis, fhe subsel of fhose varianls lhal lie 
wilhin fhe lop 0.5% of our resulfs, and fhe subsel of varianls lhal reach significance al a local false 
discovery rate (Ifdr) of 5% (Slrimmer 2008). 

Furlhermore, we idenlify several genetic varianls associafed wifh Crohn’s lhal were previously 
unidenlified. In parficular, Ihere are a few genefic varianls wilhin Ihe 3.6MB region lhal defines fhe 
MHC on chromosome 6 in Ihe human genome ( [sequencing consortium 19991 al an Ifdr Ihreshold 
of 10%. In particular, genetic varianl rs9269186 (p < 7.03 x 10“^) passes below an Ifdr Ihresh¬ 
old of 1%, and lies wilhin 5KB of fhe slarl of fhe HLA-DRB5 gene, pulalively acting lo regulate 
Iranscriplion of Ihis protein-coding gene lhal produces a membrane-bound class II molecule. The 
HLA-DRB5 protein is an antigen lhal has an imporlanl role in Ihe human immune system, and Ihus 
may play a role in an autoimmune disorder such as Crohn’s disease. 
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source 

significanl varianls 

top 0.5% 

Ifdr 5% 

All 

151 

61 

35 

MA -t WTCCC 

93 

48 

30 

MA 

81 

47 

29 


Table 3: Enrichment of genetic associations found in our results among related Crohn’s dis¬ 
ease studies. Number of genetic variants associated with Crohn’s disease risk identified by 
previous studies that are present in the top 0.5% of the results obtained from our method. 
MA = meta-analysis. MHC = major histocompatibility complex, which is a region pre¬ 
viously identified as associafed wifh Crohn’s and aufoimmune diseases more generally. 
WTCCC = original analysis performed on dafa. All = MT -i- MHC -i- WTCCC. 


5. Discussion 


Massive high dimensional dafa sefs are ubiquifous in modem dafa analysis sellings. In fhis paper we 
address Ihe problem of Iracfably and accuralely performing eigendecomposilions when analyzing 
such dafa. The main compulalional fool we use is based on recenf randomized algorilhms developed 
by Ihe numerical analysis communify. Taking info accounf fhe presence of noise in Ihe dafa we 
provide an adaplive melhod for eslimafion of bolh fhe rank d* and number of Krylov ileralions t* for 
Ihe randomized approximale version of SVD. Using fhis adaplive esfimalor of low-rank slruclure, 
we implemenl efficienl algorilhms for PCA and linear mixed models. Perhaps, fhe mosl inleresling 
slafislical observalion is relaled lo fhe regularization fhaf randomization implicilly imposes on fhe 
resulfing eigendecomposilion esfimafes. 

In simulated experimenls we show high accuracy in recovering fhe Irue (lalenf) rank of malrices 
wifh low-rank subslruclure, wilhoul fhe need for many Krylov iterations. Addilionally, we show in 
simulations lhal our melhod performs implicil regularization and improves quanlilalive properties 
of resulfs under various sellings of dafa slruclure. Furlhermore, our resulls on modern genome¬ 
wide association sludies show lhal our approach lo using ARSVD in linear mixed models fills a 
critical need for melhods wilh compulalional efficiency lhal do nol sacrifice Ihe desirable slafislical 
properties of Ihese lesls, bul instead have Ihe potential lo improve upon Ihese properties. Some 
imporlanl open questions still remain: 

(1) There is need for a Iheorelical framework lo quantify whal generalization guarantees Ihe ran¬ 
domization algorilhm can provide on oul-of-sample dala, and Ihe dependence of Ihis bound 
on Ihe noise and Ihe slruclure in Ihe dala on one hand and on Ihe parameter sellings on Ihe 
olher. 


(2) A probabilistic inlerprelalion of Ihe algorilhm could conlribule additional insighls into Ihe 
practical ulilily of Ihe proposed approach under differenl assumptions. In particular if would 
be interesting to relate our work to a Bayesian model wilh posterior modes lhal correspond to 
Ihe subspaces estimated by Ihe randomized approach. 


(3) The implicil regularization on lalenl factors imposed by ARSVD should be further explored 
wilh respecl to Ihe slruclure of Ihe noise, and ils impacl on estimates of random effecls in 
LMMs ( Runcie and Mukherjee[ 20131. 
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Appendix A. 

A.l Generalized eigendecomposition and dimension reduction 

The appendix states a variety of dimension reduction methods supervised, unsupervised, and non¬ 
linear can use the ARSVD engine to scale to massive data. The key requirement is a a formulation 
of the truncated generalized eigendecomposition problem that can be implemented by the Adaptive 
Randomized SVD from Section |2.4| The dimension reduction methods we will focus on are sliced 
inverse regression (SIR) and localized sliced inverse regression (ESIR). 

A. 1.1 Problem Eormulation. 

Assume we are given S G F £ SS^ that characterize pairwise relationships in the data and 

let r <C min(n, p) be the ’’intrinsic dimensionality” of the information contained in the data. In the 
case of supervised dimension reduction methods this corresponds to the dimensionality of the linear 
subspace to which the joint distribution of {X, Y) assigns non-zero probability mass. Our objective 
is to find a basis for that subspace. For SIR and ESIR this corresponds to the span of the generalized 
eigenvectors {gi, ..., pr} with largest eigenvalues {Amax = Ai < ... < A^}: 

Tg = XYg. (1) 

An important structural constraint we impose on F, which is applicable to a variety of high-dimensional 
data settings, is that it has low-rank: r < d* = rank(r) <C p. It is this constraint that we will take 
advantage of in the randomized methods. In the case of S = I (unsupervised case) r = d*. 

A. 1.2 Sufficient dimension reduction 

Dimension reduction is often a first step in the statistical analysis of high-dimensional data and 
could be followed by data visualization or predictive modeling. If the ultimate goal is the latter, 
then the statistical quantity of interest is a low dimensional summary Z = R{X) which captures all 
the predictive information in X relevant to Y : 


Y = f{X) + e = h{Z) + s, X £RP,Z £R\ r ^p. 


Sufficienl dimension reducfion (SDR) is one popular approach for eslimafing Z, which (Ei 

1991 

Cook and Weisberg[|1991 

Eij \992\ Ei el al.| 2005| Nilsson el al. 

2007[ Sugiyamal 20071 

Cook 

2007 

Wu el al. 

2010 

I. In Ibis appendix we focus on linear SDRs: G = {gi,..., g^) £ 


R{X) = X, which provide a prediction-optimal reduction of X 


(Y I X) = (Y I G'^X), = is equivalence in distribution. 
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We will consider two specific dimension reduction methods: Sliced Inverse Regression (SIR) 
( [Li 19911 and Localized Sliced Inverse Regression (LSIR) ( Wu et aL| 2010). SIR is effective 
when the predictive structure in the data is global, i.e., there is single predictive subspace over the 
support of the marginal distribution of X. In the case of local or manifold predictive structure in the 
data, LSIR can be used to compute a projection matrix G that contains this non-linear (manifold) 
structure. 


A. 1.3 Efficient solutions and approximate SVD 

SIR and LSIR reduce to solving a truncated generalized eigendecomposition problem in ([T]). Since 
we consider estimating the dimension reduction based on sample data we focus on the sample esti¬ 
mators S = ^X'^X and Txy = X"^KxyX, where Kxy is symmetric and encodes the method- 
specific grouping of the samples based on the response Y. In the classic statistical setting, when 
n > p, both S and Txy are positive definite almost surely. Then, a typical solution proceeds by first 
sphering the data: Z = T,~ 2 X, e.g., using a C holesky or SVD represen tation E = Sa (Ea )^. This 


is followed by eigendecomposition of T^y ^ (19911; 


Wu et al. 


( |2010[ ) and back-transformation of 
the top eigenvectors directions to the canonical basis. The computational time is 0{np^). When 
n < p, E and f are rank-deficient and a unique solution to the problem ([^ does not exist. One 
widely-used approach, which allows us to make progress in this problematic setting, is to restrict 
our attention to the directions in the data with positive variance. Then we can proceed as before, 
using an orthogonal projections onto the span of the data. The total computation time in this case is 
O(n^p). In many modern data analysis applications both n and p are very large, and hence algorith¬ 
mic complexity of (9[max(n, p) x min(n, p)^] could be prohibitive, rendering the above approaches 
unusable. We propose an approximate solution that explicitly recovers the low-rank structure in T 


using Adaptive Randomized SVD from Section 2.4 In particular, assume rank (T) = d* > r (where 

r is the dimensionality of the optimal dimension reduction subspace). Then T US'^U'^, where 
U G The generalized eigendecomposition problem Q solution becomes restricted to the 

subspace spanned by the columns of T: 


s-^u'^ms-^e = ye, e = SU^g. 
A 


( 2 ) 


The dimension reduction subspace is contained in the span(G), where G = {US ^ei,... ,US 
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